Welcome back!!!

Recap of the part 1.

There are a few chapters,

  1. Introduction (Hello World!)
  2. Install R-studio – We successfully installed R-studio and familiarized ourselves with its interface.
  3. Yes, you are ready
  4. Data preparation + visualization
  5. Analysis of the observations
  6. Good luck :)

Part 2 might take a long take, but it will be fun!!

So let’s open R-studio. Behold the visage that greets thine eye.

REMEMBER, if you turned off your Nintendo switch yesterday and you want to play Super Mario today, you do not need to buy the game again but you still need to load the game card, right? So same with RM’s game, if you wanted to jump and eat mushrooms, you need use the appropriate library() calls to start the ‘game’!

Before you run the following code, check the top right corner Environment tab (should say ‘Environment is empty’ before you run). If there is something in the environment, click the broom to clear the environment. A pop-up will ask to confirm removing and you should say yes. As it tells you, you cannot get objects in your environment back. But that’s ok because you will have a saved script (code file) to generate any object that you need again.

The environment tab is a very useful UI that R-studio provides; the R programming language uses object-oriented programming (OOP) concepts so each data type or “class” can deal with data in a slightly different way. This is like in the often-ignored Super Mario 2, other characters has a separate set of abilities than Mario: for example Luigi will allow you to jump higher and farther than Mario.

# Load the tidyverse package (game)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load the ggplot2 package (game)
library(ggplot2)
# Load the ggpmisc package (game)
library(ggpmisc)
## Warning: package 'ggpmisc' was built under R version 4.3.3
## Loading required package: ggpp
## Registered S3 methods overwritten by 'ggpp':
##   method                  from   
##   heightDetails.titleGrob ggplot2
##   widthDetails.titleGrob  ggplot2
## 
## Attaching package: 'ggpp'
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate
## 
## Registered S3 method overwritten by 'ggpmisc':
##   method                  from   
##   as.character.polynomial polynom
# Load the gganimate package (game)
library(gganimate)
# Load the animation package (game)
library(animation)
# Load the animation package (game)
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

2. Yes, you are ready

Let me remind you about the goal of MLRS 101

GOAL: learn how to make some results (plots / stats) with R-studio to answer your PI's incessant desire for "Any updates?" 

Soooooo, before we jump into real genomics data to answer their question, let’s start with some popular culture data (basketball) to make some figures and present some stats.

Before we jump into any kind of data, do you know who is this?

It is Atlanta Hawks’ Trae Young who likes to sink deep treys (make three point shots).

If you don’t watch NBA we’re sorry (but not sorry)… If you are interested in other sports games or teams or similar types of data you can apply part of the code I wrote here to do some fun activities! But for today let’s start with an NBA game of the Atlanta Hawks (RM’s team) vs. the New York Knicks (Maggie x Jason) and have fun with data!!
What we are going to do is to check how scores changed over time from the March 11th, 2020 game that ATL HawksRM played against NY Knicks^Jason & Maggie^ at State Farm Arena in Atlanta Georgia.

NOW! We need NBA game data that we can compare the scores of the two teams as the game progressed changed using R-studio.

3. Data preparation + visualization

1) Download the data

Here: you can download NBA data https://sports-statistics.com/database/basketball-data/nba/2019-20_pbp.csv for the entire league for all of the COVID-impacted 2019-2020 season.

For the code to follow, I will be using the csv (“2019-20_pbp.csv”) as saved to my “Downloads” folder

Before you bring the data to R-studio let’s open our CSV file and see what’s in there.

Note: this file is large. If you wish, you can try typing the following at a command prompt first head 2019-20_pbp.csv > 2019-20_pbp-head.csv and then open 2019-20_pbp-head.csv. However, although head is standard in linux (and macs), it may not be available on your windows installation. However, after initial viewing of this file, you’ll want to use the “full” file in all subsequent R analysis.

The first row of the file is showing all the names of the columns and from the second row on we can see the data. When you scroll to the right, you can see that this data is not only showing points made, but also including other records e.g. fouls, misses, steals and other plays.

2) Load the data to R-studio

Our file is csv file, so we will use the “read.csv()” function to bring this file to R-studio. If the data is text-based but not comma separated (e.g. separated by pipe |, tab [\t as understood by this and many other functions] or space ) you can use sep="|" as one of the parameters to read.csv(), but we would then If the file is not a text-based file, you may need another file reading fuction (e.g. readRDS()).

# Here we use the function "read.csv" to read/load the csv into Rstudio. 
# The path to the file is given as the argument to the function.
# "~" points to your working directory. Let's find out where that is.
path.expand('~')
## [1] "/Users/jasonyoo"
# Writing "~" first means you don't have to add the whole beginning of the path (probably "C:/Users/you/Documents"), just start your instructions to the file from where ~ ends.
NBA_19_20 <- read.csv("~/Downloads/2019-20_pbp.csv")

After you run NBA_19_20 <- read.csv("\~/Downloads/2019-20_pbp.csv") you will see a new object called “NBA_19_20”.

Alternatively, large text-based files could be read in more quickly with data.table’s fread() (see here and here to answer the question of “why bother?”) and then converted to the data.frame data type used elsewhere in this tutorial.

NBA_19_20 <- data.table::fread("~/Downloads/2019-20_pbp.csv")

3) Check the data

We want to see how our data (Mario) look like, how big, color of the hat… etc.

## Let's see how big is our Mario (data) with a function called dim()
dim(NBA_19_20)
## [1] 539265     41

The result is showing the number of the rows and columns, there are 539,265 rows and 41 columns. These numbers should match with the numbers you’ll see it from the “Environment” tab.

Let’s check like top six rows from our data (Mario) with head() function within R. If you want to see the last six rows from our data, you can try tail(NBA_19_20).

## head() function to check the top rows
# Default is six, but if you want to see top 20 rows then you can write like head(NBA_19_20, n=20) 
head(NBA_19_20) 
##                             URL GameType                        Location
##                          <char>   <char>                          <char>
## 1: /boxscores/201910220TOR.html  regular Scotiabank Arena Toronto Canada
## 2: /boxscores/201910220TOR.html  regular Scotiabank Arena Toronto Canada
## 3: /boxscores/201910220TOR.html  regular Scotiabank Arena Toronto Canada
## 4: /boxscores/201910220TOR.html  regular Scotiabank Arena Toronto Canada
## 5: /boxscores/201910220TOR.html  regular Scotiabank Arena Toronto Canada
## 6: /boxscores/201910220TOR.html  regular Scotiabank Arena Toronto Canada
##               Date    Time WinningTeam Quarter SecLeft AwayTeam
##             <char>  <char>      <char>   <int>   <int>   <char>
## 1: October 22 2019 8:00 PM         TOR       1     720      NOP
## 2: October 22 2019 8:00 PM         TOR       1     708      NOP
## 3: October 22 2019 8:00 PM         TOR       1     707      NOP
## 4: October 22 2019 8:00 PM         TOR       1     707      NOP
## 5: October 22 2019 8:00 PM         TOR       1     689      NOP
## 6: October 22 2019 8:00 PM         TOR       1     685      NOP
##                                                        AwayPlay AwayScore
##                                                          <char>     <int>
## 1: Jump ball: D. Favors vs. M. Gasol (L. Ball gains possession)         0
## 2:                     L. Ball misses 2-pt jump shot from 11 ft         0
## 3:                               Offensive rebound by D. Favors         0
## 4:                            D. Favors makes 2-pt layup at rim         2
## 5:                                                                      2
## 6:                               Defensive rebound by J. Redick         2
##    HomeTeam                               HomePlay HomeScore
##      <char>                                 <char>     <int>
## 1:      TOR                                                0
## 2:      TOR                                                0
## 3:      TOR                                                0
## 4:      TOR                                                0
## 5:      TOR O. Anunoby misses 2-pt layup from 3 ft         0
## 6:      TOR                                                0
##                   Shooter       ShotType ShotOutcome ShotDist Assister Blocker
##                    <char>         <char>      <char>    <int>   <char>  <char>
## 1:                                                         NA                 
## 2:     L. Ball - balllo01 2-pt jump shot        miss       11                 
## 3:                                                         NA                 
## 4:  D. Favors - favorde01     2-pt layup        make        0                 
## 5: O. Anunoby - anunoog01     2-pt layup        miss        3                 
## 6:                                                         NA                 
##    FoulType Fouler Fouled             Rebounder ReboundType ViolationPlayer
##      <char> <char> <char>                <char>      <char>          <char>
## 1:                                                                         
## 2:                                                                         
## 3:                        D. Favors - favorde01   offensive                
## 4:                                                                         
## 5:                                                                         
## 6:                        J. Redick - redicjj01   defensive                
##    ViolationType TimeoutTeam FreeThrowShooter FreeThrowOutcome FreeThrowNum
##           <char>      <char>           <char>           <char>       <char>
## 1:                                                                         
## 2:                                                                         
## 3:                                                                         
## 4:                                                                         
## 5:                                                                         
## 6:                                                                         
##    EnterGame LeaveGame TurnoverPlayer TurnoverType TurnoverCause TurnoverCauser
##       <char>    <char>         <char>       <char>        <char>         <char>
## 1:                                                                             
## 2:                                                                             
## 3:                                                                             
## 4:                                                                             
## 5:                                                                             
## 6:                                                                             
##       JumpballAwayPlayer   JumpballHomePlayer       JumpballPoss    V41
##                   <char>               <char>             <char> <lgcl>
## 1: D. Favors - favorde01 M. Gasol - gasolma01 L. Ball - balllo01     NA
## 2:                                                                   NA
## 3:                                                                   NA
## 4:                                                                   NA
## 5:                                                                   NA
## 6:                                                                   NA

Here is more organized version of the data you can compare the data from your Environment.

# Here we specify to look use the function "paged_table" in the package "rmarkdown"
# If you do not have the entire package loaded with a `library()` command (but it was installed in a previous session), you can access just this function
# For example, maybe we want to use Mario's car, but not install all of mariokart, so we might use mariokart::car

rmarkdown::paged_table(NBA_19_20)

As you can see this is quite too much information for us, so let’s take a look what kinds of information we can get out from this data that suits our goal through the name of columns.

colnames(NBA_19_20)
##  [1] "URL"                "GameType"           "Location"          
##  [4] "Date"               "Time"               "WinningTeam"       
##  [7] "Quarter"            "SecLeft"            "AwayTeam"          
## [10] "AwayPlay"           "AwayScore"          "HomeTeam"          
## [13] "HomePlay"           "HomeScore"          "Shooter"           
## [16] "ShotType"           "ShotOutcome"        "ShotDist"          
## [19] "Assister"           "Blocker"            "FoulType"          
## [22] "Fouler"             "Fouled"             "Rebounder"         
## [25] "ReboundType"        "ViolationPlayer"    "ViolationType"     
## [28] "TimeoutTeam"        "FreeThrowShooter"   "FreeThrowOutcome"  
## [31] "FreeThrowNum"       "EnterGame"          "LeaveGame"         
## [34] "TurnoverPlayer"     "TurnoverType"       "TurnoverCause"     
## [37] "TurnoverCauser"     "JumpballAwayPlayer" "JumpballHomePlayer"
## [40] "JumpballPoss"       "V41"

We will need a subset of columns to retrieve the specific game we are interested in and to check how scores changed over time and to plot the information.

  1. Location: State Farm Arena Atlanta Georgia

  2. Date: March 11 2020

  3. Quarter: which quarter they are playing

  4. SecLeft: how many seconds left per quarter

  5. AwayTeam: NYK

  6. AwayScore: we will plot this as dependent variable

  7. HomeTeam: ATL

  8. HomeScore: we will plot this as dependent variable

  9. ShotType: for the summary outcome

  10. ShotOutcome: for the summary outcome

  11. HomePlay: for the summary outcome

  12. AwayPlay: for the summary outcome

Let’s learn how to shrink our massive data by selecting columns only we want!

** you should select columns based on your purpose all the time!

4) Select the columns

In this case, we are going to select eight columns; Location, Date, Quarter, SecLeft, AwayTeam, AwayScore, HomeTeam, HomeScore, ShotType, ShotOutcome, HomePlay, AwayPlay +. might have to add a description about pipe function….

library(dplyr)
# Let's select 12 columns with select function. 
NBA_19_20_with_12cols <- NBA_19_20 %>% select(Location, Date,
                                        Quarter, SecLeft, 
                                        AwayTeam, AwayScore,
                                        HomeTeam, HomeScore,
                                        ShotType, ShotOutcome,
                                        HomePlay, AwayPlay
                                        )

# Here we have used the pipe "%>%" which was loaded thanks to tidyverse. This weird set of symbols says "use the thing before %>% as an argument for the following function"
# Whenever you use the pipe, there is another valid way you could have written the function. Let's write it without the pipe as well
NBA_19_20_with_12cols <- select(NBA_19_20, Location, Date,
                                        Quarter, SecLeft, 
                                        AwayTeam, AwayScore,
                                        HomeTeam, HomeScore,
                                        ShotType, ShotOutcome,
                                        HomePlay, AwayPlay)

# So why would you use the pipe or not? I think the first way is easier to read. We know that before the pipe is the dataset we are selecting from, and everything inside the parentheses is the columns we are selecting.
# Readability becomes even more obvious when you use the pipe to chain functions together. For example, we could read in the csv and select all at once with the pipe
NBA_19_20_with_12cols <- read.csv("~/Downloads/2019-20_pbp.csv") %>% 
                        select(Location, Date, Quarter, 
                               SecLeft, AwayTeam, AwayScore, 
                               HomeTeam, HomeScore,
                               ShotType, ShotOutcome,
                               HomePlay, AwayPlay)

# Which can also be written without the pipe
NBA_19_20_with_12cols <- select(read.csv("~/Downloads/2019-20_pbp.csv"), Location, Date,
                               Quarter, SecLeft, AwayTeam,
                               AwayScore, HomeTeam, HomeScore,
                               ShotType, ShotOutcome,
                               HomePlay, AwayPlay)

# So should you chain all your commands together forever? No, probably not.
# For example in this case, if we were to select columns during our read.csv() operation, we never get the chance to view the whole dataset or use dim(), head(), colnames(), etc.

Pause here,

You can also see two object on the right top “Environment” tab, one is “NBA_19_20” the other is “NBA_19_20_with_12cols”. The “NBA_19_20” object has 539,265 rows and 41 columns, but “NBA_19_20_with_10cols” has 539,265 rows and 12 columns that we selected with “select()” function. This is one of the ways to check how our object changed before and after our code.

Compare your data this table.

rmarkdown::paged_table(NBA_19_20_with_12cols)

We can use the object with 539,265 rows and 12 columns, but we still have some rooms to optimize our process. We can narrow it down to only the game that we are interested in at State Farm Arena. (For this case, we only have 539,265 rows, so it’s not a big of deal. But what if you have a data with 3 million rows? This is will significantly delaying data processing time, so it’s better to learn how to grab rows what we need / want. )

5) Filter the rows

5-1) Games from State Farm Arena

In this case, we are going to use function called “filter()” to filter out the games that happened in “State Farm Arena Atlanta Georgia”. The filter() function requires two different components, one is object that we wanted to filter and the other is the condition. So in this case we wanted to filter from “NBA_19_20_with_12cols” object based on the “Location” column is equal to “State Farm Arena Atlanta Georgia”.

NBA_19_20_only_SFA <- filter(NBA_19_20_with_12cols, Location == "State Farm Arena Atlanta Georgia")

# How would you write this using the pipe?
NBA_19_20_only_SFA <- NBA_19_20_with_12cols %>% filter(Location == "State Farm Arena Atlanta Georgia")

Welcome to State Farm Arena

Let’s check our data with the head() function. In addition when you check your “Environment” tab, you can see that your “NBA_19_20_only_SFA” has 16,929 rows and 12 columns.

head(NBA_19_20_only_SFA) 
##                           Location            Date Quarter SecLeft AwayTeam
## 1 State Farm Arena Atlanta Georgia October 26 2019       1     720      ORL
## 2 State Farm Arena Atlanta Georgia October 26 2019       1     707      ORL
## 3 State Farm Arena Atlanta Georgia October 26 2019       1     688      ORL
## 4 State Farm Arena Atlanta Georgia October 26 2019       1     683      ORL
## 5 State Farm Arena Atlanta Georgia October 26 2019       1     683      ORL
## 6 State Farm Arena Atlanta Georgia October 26 2019       1     670      ORL
##   AwayScore HomeTeam HomeScore       ShotType ShotOutcome
## 1         0      ATL         0                           
## 2         0      ATL         2 2-pt jump shot        make
## 3         0      ATL         2     2-pt layup        miss
## 4         0      ATL         2                           
## 5         2      ATL         2     2-pt layup        make
## 6         2      ATL         4 2-pt jump shot        make
##                                   HomePlay
## 1                                         
## 2 T. Young makes 2-pt jump shot from 14 ft
## 3                                         
## 4                                         
## 5                                         
## 6 T. Young makes 2-pt jump shot from 18 ft
##                                                     AwayPlay
## 1 Jump ball: N. Vuevi vs. A. Len (T. Young gains possession)
## 2                                                           
## 3 A. Gordon misses 2-pt layup from 2 ft (block by D. Hunter)
## 4                             Offensive rebound by A. Gordon
## 5                       A. Gordon makes 2-pt layup from 1 ft
## 6
## again head is going to show you the first six rows from the "NBA_19_20_only_SFA" data. 

Let’s learn another function (unique()) that is quite useful to check that we only include the NBA games from State Farm Arena. We are going to check with print() and unique() function. The print() function will give you the output in the Console tab. The unique() function returns unique values from out input. So the following code is trying to print unique values from the “Location” column of “NBA_19_20_only_SFA” data.

print(unique(NBA_19_20_only_SFA$Location))
## [1] "State Farm Arena Atlanta Georgia"

5-2) Games on March 11th 2020

Second, filter the game that happened at “March 11 2020” and check our data with head() and unique() functions.

print(unique(NBA_19_20_only_SFA$Date))
##  [1] "October 26 2019"  "October 28 2019"  "October 31 2019"  "November 5 2019" 
##  [5] "November 6 2019"  "November 8 2019"  "November 20 2019" "November 23 2019"
##  [9] "November 25 2019" "December 2 2019"  "December 4 2019"  "December 13 2019"
## [13] "December 15 2019" "December 19 2019" "December 27 2019" "January 4 2020"  
## [17] "January 6 2020"   "January 8 2020"   "January 14 2020"  "January 18 2020" 
## [21] "January 20 2020"  "January 22 2020"  "January 26 2020"  "January 30 2020" 
## [25] "February 3 2020"  "February 9 2020"  "February 20 2020" "February 22 2020"
## [29] "February 26 2020" "February 28 2020" "February 29 2020" "March 2 2020"    
## [33] "March 9 2020"     "March 11 2020"
NBA_19_20_SAF_March_11 <- filter(NBA_19_20_only_SFA, Date == "March 11 2020") 
head(NBA_19_20_SAF_March_11)
##                           Location          Date Quarter SecLeft AwayTeam
## 1 State Farm Arena Atlanta Georgia March 11 2020       1     716      NYK
## 2 State Farm Arena Atlanta Georgia March 11 2020       1     708      NYK
## 3 State Farm Arena Atlanta Georgia March 11 2020       1     705      NYK
## 4 State Farm Arena Atlanta Georgia March 11 2020       1     699      NYK
## 5 State Farm Arena Atlanta Georgia March 11 2020       1     696      NYK
## 6 State Farm Arena Atlanta Georgia March 11 2020       1     690      NYK
##   AwayScore HomeTeam HomeScore       ShotType ShotOutcome
## 1         0      ATL         0                           
## 2         0      ATL         0     2-pt layup        miss
## 3         0      ATL         0                           
## 4         0      ATL         0 3-pt jump shot        miss
## 5         0      ATL         0                           
## 6         0      ATL         0 2-pt jump shot        miss
##                                     HomePlay
## 1                                           
## 2                                           
## 3             Defensive rebound by D. Dedmon
## 4 D. Dedmon misses 3-pt jump shot from 25 ft
## 5                  Offensive rebound by Team
## 6  T. Young misses 2-pt jump shot from 16 ft
##                                  AwayPlay
## 1                                        
## 2 M. Harkless misses 2-pt layup from 1 ft
## 3                                        
## 4                                        
## 5                                        
## 6
print(unique(NBA_19_20_SAF_March_11$Date))
## [1] "March 11 2020"

As you can see by looking at your “Environment” tab your data (“NBA_19_20_SAF_March_11”) is now with 533 rows and 8 columns [Matt note to Jason: should this rather still be 12 columns? I don’t think filter drops the columns you filter on. But this isn’t my typical method of selection so I’m not 100% sure.]

6) Add new column

Our goal is to make a figure that shows the scores for each team changing over time. So we need two related sets of data (home and away scores), with time – represented by SecLeft – as the “independent variable” (x-axis) for both. AwayScore and HomeScore will be the dependent variables (y-axis). This is what our final figure will look like!

6-1) Visualize what we have

First, let’s see how ATL Hawks score changes over time with HomeScore and SecLeft columns. We are going to use functions ggplot() and geom_line(). Using ggplot is like building a castle with Legos. You can start with making a “wall of simple blocks” (a plot with default parameters), and keep customizing your castle’s complexity by adding “a moat and bridge” (trend lines) or “turrets” (different symbols instead of dots). You can even select your water color for your moat, decide that you’d like several sections to your bridge (dashed lines) and pick brick color(s) for your turrets.

We start each ggplot figure with the command ggplot(). This is the foundational block that we will add other blocks onto. First we need to tell ggplot where to look for the data, then we can tell it how to display the data. Anything we specify in this foundational block will be used in later blocks as well.

The ggplot() function use aes() which is “Aesthetic mappings describe how variables in the data are mapped to visual properties (aesthetics) of geoms”. In short, this is the part that you are going to tell ggplot that I want “SecLeft” on the x-axis and “HomeScore” in the y-axis.

library(ggplot2)
# We tell ggplot that out data is "NBA_19_20_SAF_Match_11" and that "SecLeft" should be displayed on the x axis.
HAWKS_plot <- ggplot(NBA_19_20_SAF_March_11, aes(x = SecLeft, y = HomeScore)) +
                              geom_line() 
## here is the same code with two aes(); Aesthetic mappings can be set in ggplot() and in individual layers.
HAWKS_plot_2 <- ggplot(NBA_19_20_SAF_March_11, aes(x = SecLeft)) +
                              geom_line(aes(y = HomeScore)) 
# Then we ADD geom_line to our foundational block. geom_line allows us to make... line graphs! We have already specified that SecLeft will be the x axis, so now we tell ggplot that HomeScore will be the y axis.
# Let's display the plot
HAWKS_plot

HAWKS_plot_2

6-2) Understand what we have

As you can see from “HAWKS_plot” results above, this is not what we wanted to see. There are some patterns, but this is not what we wanted, as the score is not strictly increasing as the SecLeft decreases. Let’s check the data we plotted here, but we’re going to cheat a bit by letting you know that there is another column of data that will help explain why the lines are hopping all over the place. SecLeft isn’t simply time left in the game. Instead it is time left in the quarter of play, with each quarter being 12 minutes (720 seconds) in duration. So, let us grab the quarter of the game, which will be generally a number from 1-4* in addition to the other information that we will plot. So, we will do this by going back to our raw data for this game and retrieving the Quarter and SecLeft along with the score for the home team.

  • for a game that wasn’t in a tie (both teams having equal score) at the end of the 4th quarter.
NBA_19_20_SAF_March_11_time_n_score <- NBA_19_20_SAF_March_11 %>% select(Quarter, SecLeft, HomeScore)
NBA_19_20_SAF_March_11_time_n_score
##     Quarter SecLeft HomeScore
## 1         1     716         0
## 2         1     708         0
## 3         1     705         0
## 4         1     699         0
## 5         1     696         0
## 6         1     690         0
## 7         1     686         0
## 8         1     681         0
## 9         1     679         0
## 10        1     676         0
## 11        1     676         0
## 12        1     656         0
## 13        1     652         0
## 14        1     645         0
## 15        1     640         0
## 16        1     637         0
## 17        1     620         0
## 18        1     610         2
## 19        1     593         2
## 20        1     580         2
## 21        1     576         2
## 22        1     566         2
## 23        1     553         4
## 24        1     545         4
## 25        1     540         4
## 26        1     536         4
## 27        1     532         4
## 28        1     525         4
## 29        1     504         4
## 30        1     500         4
## 31        1     499         4
## 32        1     499         4
## 33        1     499         4
## 34        1     487         4
## 35        1     484         4
## 36        1     484         4
## 37        1     483         4
## 38        1     476         4
## 39        1     463         6
## 40        1     439         6
## 41        1     437         6
## 42        1     429         6
## 43        1     426         6
## 44        1     426         6
## 45        1     426         6
## 46        1     406         6
## 47        1     402         6
## 48        1     399         6
## 49        1     397         6
## 50        1     388         6
## 51        1     384         6
## 52        1     375         9
## 53        1     362         9
## 54        1     360         9
## 55        1     355         9
## 56        1     350         9
## 57        1     346         9
## 58        1     346         9
## 59        1     346         9
## 60        1     346         9
## 61        1     334         9
## 62        1     331         9
## 63        1     325         9
## 64        1     312         9
## 65        1     308         9
## 66        1     308         9
## 67        1     308         9
## 68        1     305        11
## 69        1     299        11
## 70        1     299        11
## 71        1     299        11
## 72        1     299        11
## 73        1     299        11
## 74        1     288        11
## 75        1     284        11
## 76        1     275        11
## 77        1     273        11
## 78        1     267        11
## 79        1     262        11
## 80        1     261        11
## 81        1     261        11
## 82        1     261        11
## 83        1     240        11
## 84        1     236        11
## 85        1     227        11
## 86        1     227        11
## 87        1     227        11
## 88        1     227        11
## 89        1     218        11
## 90        1     214        11
## 91        1     210        11
## 92        1     208        11
## 93        1     198        11
## 94        1     184        14
## 95        1     169        14
## 96        1     151        14
## 97        1     151        15
## 98        1     151        15
## 99        1     151        15
## 100       1     151        16
## 101       1     138        16
## 102       1     134        16
## 103       1     128        19
## 104       1     128        19
## 105       1     128        19
## 106       1     107        19
## 107       1      87        19
## 108       1      80        19
## 109       1      67        19
## 110       1      63        19
## 111       1      53        21
## 112       1      53        21
## 113       1      53        22
## 114       1      33        22
## 115       1      31        22
## 116       1      31        22
## 117       1      31        22
## 118       1      31        22
## 119       1      24        24
## 120       1      24        24
## 121       1      24        24
## 122       1      22        24
## 123       1       4        24
## 124       1       4        24
## 125       1       0        24
## 126       2     700        24
## 127       2     684        24
## 128       2     678        24
## 129       2     678        24
## 130       2     678        24
## 131       2     662        26
## 132       2     649        26
## 133       2     642        26
## 134       2     624        26
## 135       2     619        26
## 136       2     614        26
## 137       2     611        26
## 138       2     595        26
## 139       2     592        26
## 140       2     584        26
## 141       2     581        26
## 142       2     566        26
## 143       2     562        26
## 144       2     562        26
## 145       2     562        26
## 146       2     562        26
## 147       2     562        26
## 148       2     562        26
## 149       2     546        26
## 150       2     543        26
## 151       2     540        26
## 152       2     535        26
## 153       2     529        26
## 154       2     525        26
## 155       2     500        28
## 156       2     488        28
## 157       2     484        28
## 158       2     478        28
## 159       2     476        28
## 160       2     469        28
## 161       2     464        28
## 162       2     450        28
## 163       2     446        28
## 164       2     433        28
## 165       2     428        28
## 166       2     424        30
## 167       2     423        30
## 168       2     423        30
## 169       2     423        30
## 170       2     423        30
## 171       2     423        30
## 172       2     413        30
## 173       2     398        30
## 174       2     394        30
## 175       2     389        30
## 176       2     364        30
## 177       2     364        30
## 178       2     364        30
## 179       2     341        30
## 180       2     337        32
## 181       2     322        32
## 182       2     303        32
## 183       2     300        32
## 184       2     286        32
## 185       2     275        34
## 186       2     260        34
## 187       2     248        36
## 188       2     248        36
## 189       2     248        36
## 190       2     248        36
## 191       2     245        36
## 192       2     239        36
## 193       2     239        36
## 194       2     239        36
## 195       2     239        36
## 196       2     239        36
## 197       2     235        36
## 198       2     235        36
## 199       2     232        39
## 200       2     209        39
## 201       2     189        39
## 202       2     186        39
## 203       2     185        39
## 204       2     185        39
## 205       2     177        39
## 206       2     172        39
## 207       2     172        39
## 208       2     162        39
## 209       2     160        39
## 210       2     150        39
## 211       2     146        39
## 212       2     141        39
## 213       2     141        39
## 214       2     141        39
## 215       2     141        39
## 216       2     129        42
## 217       2     110        42
## 218       2     100        45
## 219       2      85        45
## 220       2      67        45
## 221       2      64        45
## 222       2      48        45
## 223       2      34        45
## 224       2      32        45
## 225       2      27        47
## 226       2      27        47
## 227       2      27        47
## 228       2      27        47
## 229       2      27        48
## 230       2       9        48
## 231       2       4        48
## 232       2       2        50
## 233       2       0        50
## 234       2       0        50
## 235       2       0        50
## 236       3     705        50
## 237       3     702        50
## 238       3     693        50
## 239       3     689        50
## 240       3     687        52
## 241       3     675        52
## 242       3     671        52
## 243       3     666        52
## 244       3     662        52
## 245       3     658        52
## 246       3     644        52
## 247       3     624        54
## 248       3     621        54
## 249       3     596        54
## 250       3     586        57
## 251       3     571        57
## 252       3     567        57
## 253       3     563        57
## 254       3     563        57
## 255       3     563        57
## 256       3     563        58
## 257       3     550        58
## 258       3     545        58
## 259       3     540        58
## 260       3     540        58
## 261       3     540        58
## 262       3     540        58
## 263       3     540        58
## 264       3     525        58
## 265       3     523        58
## 266       3     514        58
## 267       3     500        58
## 268       3     497        58
## 269       3     489        58
## 270       3     470        58
## 271       3     470        59
## 272       3     470        59
## 273       3     470        60
## 274       3     451        60
## 275       3     436        60
## 276       3     434        60
## 277       3     430        60
## 278       3     430        60
## 279       3     430        60
## 280       3     430        60
## 281       3     430        60
## 282       3     430        60
## 283       3     421        62
## 284       3     401        62
## 285       3     397        62
## 286       3     393        62
## 287       3     391        62
## 288       3     388        62
## 289       3     385        62
## 290       3     380        62
## 291       3     364        64
## 292       3     364        64
## 293       3     364        64
## 294       3     364        65
## 295       3     345        65
## 296       3     341        65
## 297       3     336        65
## 298       3     334        65
## 299       3     334        65
## 300       3     334        65
## 301       3     313        67
## 302       3     298        67
## 303       3     289        67
## 304       3     286        67
## 305       3     267        67
## 306       3     255        69
## 307       3     244        69
## 308       3     229        69
## 309       3     228        69
## 310       3     228        69
## 311       3     217        71
## 312       3     202        71
## 313       3     202        71
## 314       3     202        71
## 315       3     202        71
## 316       3     189        71
## 317       3     169        71
## 318       3     160        71
## 319       3     153        71
## 320       3     153        73
## 321       3     143        73
## 322       3     143        73
## 323       3     143        73
## 324       3     130        73
## 325       3     126        73
## 326       3     121        73
## 327       3     102        73
## 328       3     102        73
## 329       3     102        73
## 330       3     102        73
## 331       3     102        73
## 332       3     102        73
## 333       3     102        73
## 334       3     102        73
## 335       3     102        73
## 336       3      94        73
## 337       3      76        73
## 338       3      71        73
## 339       3      69        73
## 340       3      67        73
## 341       3      56        73
## 342       3      53        73
## 343       3      39        73
## 344       3      34        76
## 345       3      25        76
## 346       3      22        76
## 347       3       4        76
## 348       3       1        76
## 349       3       1        78
## 350       3       0        78
## 351       4     704        80
## 352       4     687        80
## 353       4     685        80
## 354       4     679        80
## 355       4     672        80
## 356       4     669        80
## 357       4     669        80
## 358       4     669        80
## 359       4     669        80
## 360       4     669        80
## 361       4     661        82
## 362       4     661        82
## 363       4     661        83
## 364       4     645        83
## 365       4     642        83
## 366       4     642        83
## 367       4     632        83
## 368       4     613        83
## 369       4     598        83
## 370       4     596        83
## 371       4     588        83
## 372       4     584        83
## 373       4     568        83
## 374       4     563        83
## 375       4     554        83
## 376       4     552        85
## 377       4     552        85
## 378       4     552        85
## 379       4     549        85
## 380       4     539        85
## 381       4     531        85
## 382       4     531        86
## 383       4     531        86
## 384       4     531        87
## 385       4     516        87
## 386       4     512        87
## 387       4     506        90
## 388       4     505        90
## 389       4     505        90
## 390       4     505        90
## 391       4     486        90
## 392       4     484        90
## 393       4     473        90
## 394       4     473        91
## 395       4     468        91
## 396       4     455        91
## 397       4     455        91
## 398       4     455        91
## 399       4     449        91
## 400       4     445        91
## 401       4     431        91
## 402       4     412        91
## 403       4     404        94
## 404       4     382        94
## 405       4     382        94
## 406       4     382        94
## 407       4     382        94
## 408       4     380        94
## 409       4     369        94
## 410       4     365        94
## 411       4     361        94
## 412       4     348        96
## 413       4     324        96
## 414       4     324        96
## 415       4     317        96
## 416       4     299        96
## 417       4     297        96
## 418       4     297        96
## 419       4     290        99
## 420       4     273        99
## 421       4     273        99
## 422       4     273        99
## 423       4     273        99
## 424       4     270        99
## 425       4     267        99
## 426       4     267       100
## 427       4     267       101
## 428       4     261       101
## 429       4     261       101
## 430       4     261       101
## 431       4     261       101
## 432       4     258       101
## 433       4     253       103
## 434       4     253       103
## 435       4     253       103
## 436       4     240       103
## 437       4     227       103
## 438       4     204       103
## 439       4     201       103
## 440       4     188       103
## 441       4     184       103
## 442       4     168       103
## 443       4     156       103
## 444       4     153       103
## 445       4     148       103
## 446       4     148       104
## 447       4     148       105
## 448       4     139       105
## 449       4     139       105
## 450       4     131       107
## 451       4     131       107
## 452       4     116       107
## 453       4     116       107
## 454       4     116       107
## 455       4     104       110
## 456       4      87       110
## 457       4      83       110
## 458       4      76       110
## 459       4      76       111
## 460       4      76       111
## 461       4      73       111
## 462       4      72       113
## 463       4      55       113
## 464       4      47       116
## 465       4      23       116
## 466       4      19       116
## 467       4      16       118
## 468       4      15       118
## 469       4      15       118
## 470       4       4       118
## 471       4       1       118
## 472       4       0       118
## 473       5     300       118
## 474       5     283       118
## 475       5     269       118
## 476       5     265       118
## 477       5     257       118
## 478       5     252       118
## 479       5     248       120
## 480       5     228       120
## 481       5     207       120
## 482       5     205       120
## 483       5     203       122
## 484       5     187       122
## 485       5     173       122
## 486       5     170       122
## 487       5     149       122
## 488       5     146       122
## 489       5     142       122
## 490       5     142       123
## 491       5     142       123
## 492       5     142       124
## 493       5     118       124
## 494       5     110       124
## 495       5     106       124
## 496       5      84       124
## 497       5      81       124
## 498       5      75       124
## 499       5      73       124
## 500       5      73       124
## 501       5      73       124
## 502       5      64       126
## 503       5      55       126
## 504       5      50       126
## 505       5      46       126
## 506       5      44       126
## 507       5      44       126
## 508       5      43       126
## 509       5      43       126
## 510       5      43       126
## 511       5      33       126
## 512       5      26       126
## 513       5      26       126
## 514       5      26       126
## 515       5      24       126
## 516       5      24       126
## 517       5      24       126
## 518       5      20       126
## 519       5      20       127
## 520       5      20       127
## 521       5      20       128
## 522       5      20       128
## 523       5      20       128
## 524       5      19       128
## 525       5      19       128
## 526       5      19       128
## 527       5      19       128
## 528       5      19       128
## 529       5      19       128
## 530       5      19       128
## 531       5      13       131
## 532       5       0       131
## 533       5       0       131

This will display how the dataframe looked like in the order that it was present from the original CSV, assuming it was trimmed down to only these columns.

But in order to produce the plot that we did earlier where we plotted HomeScore against SecLeft, ggplot internally had re-sorted the data such that SecLeft was sorted in “ascending order” (i.e. from the smallest value of 0 to largest of 720), before applying the trend line. This was akin to doing the following

NBA_19_20_SAF_March_11_time_n_score[order(NBA_19_20_SAF_March_11_time_n_score$SecLeft),] %>% head(n=10)
##     Quarter SecLeft HomeScore
## 125       1       0        24
## 233       2       0        50
## 234       2       0        50
## 235       2       0        50
## 350       3       0        78
## 472       4       0       118
## 532       5       0       131
## 533       5       0       131
## 348       3       1        76
## 349       3       1        78

If you block out the first column, you’ll notice that there are several different scores at the end of each quarter (0 SecLeft). As a point is not scored every second of the game by each team, there will be gaps in the data for seconds in which no points changed and there were no other plays of note (fouls, steals, etc). This is why the scores of the “HAWKS_plot” oscillate “randomly” from 0 SecLeft to 720 – there may have been a score change with 3 seconds left in the 3rd quarter, but the 2nd quarter may not have had any score changes until 8 seconds left. Also, you’ll note that the scores appeared to be generally decreasing over time, but this is because SecLeft is sort of the opposite of what we want: chronologically we want to look at the score of a given quarter at negative SecLeft

6-3) Think about how

We want the plot to be plotted chronologically as the game progressed over time; So we are going to create a new column called “PlayTime” and start the game off at PlayTime=0. This is just one way to solve this problem, and you will have to figure out a solution that works for your data in the future!

Note that this step might be the most time consuming part in the future for you, but REMEMBER there is no “THE ANSWER” but rather several ways to get to the finish line. As Mario can take shortcuts through pipes to make it to the end of the level faster, maybe you’ll think of a more efficient way to do this than what we present here.

So, how will we calculate this new variable PlayTime? If we didn’t know that each NBA quarter was 12 minutes long, we could look for the largest value of SecLeft overall or in each quarter of a given game and subtract the SecLeft of a given datapoint from that maximum value.

So, let us pull out the maximum SecLeft of our first quarter of play.

NBA_19_20_SAF_March_11_time_n_score[NBA_19_20_SAF_March_11_time_n_score$Quarter==1,] %>% select(SecLeft) %>% max()
## [1] 716

So we see for Quarter 1 the maximum SecLeft is 716. In order to calculate our PlayTime column, we might think conceptually like this:

PlayTime for Q1 = max(Q1, SecLeft) - SecLeft.

However, to calculate the overall PlayTime in the game during the second quarter, we have to do like following:

PlayTime for Q2 = max(Q1, SecLeft) + max(Q2, SecLeft) - SecLeft

You may wish to compute and store an object which has the maximum value we will use for PlayTime per quarter with the max() and filter() functions.

Q1_time <- max(filter(NBA_19_20_SAF_March_11, Quarter == 1)$SecLeft)
Q2_time <- Q1_time + max(filter(NBA_19_20_SAF_March_11, Quarter == 2)$SecLeft)
Q3_time <- Q2_time + max(filter(NBA_19_20_SAF_March_11, Quarter == 3)$SecLeft)
Q4_time <- Q3_time + max(filter(NBA_19_20_SAF_March_11, Quarter == 4)$SecLeft)

They played overtime, so we have to compute Q5_time too!

Q5_time <- Q4_time + max(filter(NBA_19_20_SAF_March_11, Quarter == 5)$SecLeft)

Thankfully they didn’t go into triple-overtime!

6-4) Add new column with conditions.

NBA_19_20_SAF_March_11$PlayTime <-with(NBA_19_20_SAF_March_11,
                                       ## overtime subtract from the total second
                                        ifelse(Quarter == 5, Q5_time - NBA_19_20_SAF_March_11$SecLeft, 
                                            ifelse(Quarter == 4, Q4_time - NBA_19_20_SAF_March_11$SecLeft,
                                              ifelse(Quarter == 3, Q3_time - NBA_19_20_SAF_March_11$SecLeft,
                                                ifelse(Quarter == 2, Q2_time - NBA_19_20_SAF_March_11$SecLeft, 
                                                       Q1_time - NBA_19_20_SAF_March_11$SecLeft ) ) ) )
                                       )

Some of the rows are related to foul calls and other plays, so we will wish to then filter this down to unique rows with the pipe function (%>%) and distinct()

## 533 rows now became 416 rows. 
NBA_19_20_SAF_March_11 <- NBA_19_20_SAF_March_11 %>% distinct()

Let’s plot same code what we used from 6-1) section with our new time column.

HAWKS_plot <- ggplot(NBA_19_20_SAF_March_11, aes(PlayTime)) +   
                              geom_line(aes(y = HomeScore))
HAWKS_plot

Looks great to me :) now HAWKS score is increasing over time !!!

Let’s add more things to our new plot like Legos. First, let’s add AwayScore to our plot.

HAWKS_KNICKS_plot <- ggplot(NBA_19_20_SAF_March_11, aes(x = PlayTime)) +   
                              geom_line(aes(y = HomeScore)) +
                              geom_line(aes(y = AwayScore)) # We add another line to ggplot. Remember x is already PlayTime, but for this line, AwayScore will be the y value rather than HomeScore

HAWKS_KNICKS_plot

7) Detail is Everything

Our “HAWKS_KNICKS_plot” looks promising, but we can enhance it. If a plot contains valuable information but is only understandable to you, it needs improvement. Let’s enhance readability by adding colors to our lines, increasing their thickness, and incorporating vertical lines to mark the start of each quarter.

NBA_19_20_SAF_March_11$PlayTime <- as.numeric(NBA_19_20_SAF_March_11$PlayTime)

HAWKS_vs_KNICKS_plot <- ggplot(NBA_19_20_SAF_March_11, aes(PlayTime)) +   
                              geom_line(aes(y = AwayScore), color = "orange", size = 2.5) + 
                              geom_line(aes(y = HomeScore), color = "red", size = 2.5) +
                              ## Add quarter lines
                              geom_vline(xintercept = Q1_time, colour="black", linetype = "longdash") +
                              geom_vline(xintercept = Q2_time, colour="black", linetype = "longdash") +
                              geom_vline(xintercept = Q3_time, colour="black", linetype = "longdash") +
                              geom_vline(xintercept = Q4_time, colour="black", linetype = "longdash") +
                              geom_vline(xintercept = Q5_time, colour="black", linetype = "longdash")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
HAWKS_vs_KNICKS_plot

8) Final product with animation

## make a score board for the game
score_df <- tibble(NYK = max(NBA_19_20_SAF_March_11$AwayScore),
                    ATL = max(NBA_19_20_SAF_March_11$HomeScore))

HAWKS_vs_KNICKS_final_plot <- ggplot(NBA_19_20_SAF_March_11, aes(PlayTime)) +   
                              geom_line(aes(y = AwayScore, group = AwayTeam),  color = "orange", size = 2.5) + 
                              geom_line(aes(y = HomeScore, group = HomeTeam), color = "red", size = 2.5) + 
                              ## Add quarter lines
                              geom_vline(xintercept = Q1_time, colour="black", linetype = "longdash") +
                              geom_vline(xintercept = Q2_time, colour="black", linetype = "longdash") +
                              geom_vline(xintercept = Q3_time, colour="black", linetype = "longdash") +
                              geom_vline(xintercept = Q4_time, colour="black", linetype = "longdash") +
                              geom_vline(xintercept = Q5_time, colour="black", linetype = "longdash") +
                              labs(y= "Scores", x = "Time (seconds)") +
                              # Add table to the figure
                              annotate(geom = "table", x = 20, y = 140,
                                      label = list(score_df),
                                       vjust = 1, hjust = 0, size = 9,
                                       table.theme = ttheme_gtlight
                                       )


HAWKS_vs_KNICKS_final_plot_animated <- HAWKS_vs_KNICKS_final_plot +
                                  transition_reveal(PlayTime) +
                                  view_follow(fixed_y = TRUE) 

Based on the bottom figure, we can see that New York Knicks were in the lead for nearly the entire game, but the Hawks caught up at the end of the 4th Quarter, forcing an overtime, but in the end they lost a game at their home stadium ;///. – Sorry Richard.

HAWKS_vs_KNICKS_final_plot_animated

HAWKS_vs_KNICKS_final_plot

9) Let’s save our game

Imagine reaching level 10 of Mario, only to lose all progress! To prevent this, you’d save your game. On a Nintendo Switch you can save a “snapshot.” Similarly, let’s save our DataFrame, “NBA_19_20_SAF_March_11,” to avoid starting over. We’ll save it in your “Downloads” folder as “NBA_19_20_SAF_March_11.rds.” The “.rds” extension is a file format, similar to how a Word document uses “.docx.”

# saveRDS(NBA_19_20_SAF_March_11, "~/Downloads/NBA_19_20_SAF_March_11.rds", compress = TRUE)

Once saved, you can close RStudio as we learned in the previous section.

End of Part 2!!

I’m so proud that you’ve made it this far!

Quick review of what we’ve done for Part 1 and 2.

  1. Introduction (Hello World!)
  2. Install R-studio

– We successfully installed R-studio and familiarized ourselves with its interface.

  1. Yes, you are ready
  2. Data preparation + visualization

– We successfully learned how to visualize data as desired (including data cleaning and manipulation).

For Part 3, we’ll dive into why the Hawks lost.

  1. Analysis of the observations
  2. Good luck :)

Here are some key points for you to consider - these will be great review questions:

  1. How can we analyze all of the Philadelphia 76ers’ games?
  2. How can we improve our final plot?
  3. Are there other ways to format the data besides the method we used?